智能论文笔记

Feature-based model selection for object detection from point cloud data

Kairi Tokuda , Ryoichi Shinkuma , Takehiro Sato , Eiji Oki

分类：计算机视觉 | 机器学习

2022-09-26

使用三维（3D）图像传感器的智能监视一直在智能城市的背景下引起人们的注意。在智能监控中，实施了3D图像传感器获取的点云数据的对象检测，以检测移动物体（例如车辆和行人）以确保道路上的安全性。但是，由于光检测和范围（LIDAR）单元用作3D图像传感器或3D图像传感器的安装位置，因此点云数据的特征是多元化的。尽管迄今已研究了从点云数据进行对象检测的各种深度学习（DL）模型，但尚无研究考虑如何根据点云数据的功能使用多个DL模型。在这项工作中，我们提出了一个基于功能的模型选择框架，该框架通过使用多种DL方法并利用两种人工技术生成的伪不完整的训练数据来创建各种DL模型：采样和噪声添加。它根据在真实环境中获取的点云数据的功能，为对象检测任务选择最合适的DL模型。为了证明提出的框架的有效性，我们使用从KITTI数据集创建的基准数据集比较了多个DL模型的性能，并比较了通过真实室外实验获得的对象检测的示例结果。根据情况，DL模型之间的检测准确性高达32％，这证实了根据情况选择适当的DL模型的重要性。

translated by 谷歌翻译

Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism

Yukiya Hono , Kei Hashimoto , Yoshihiko Nankaku , Keiichi Tokuda

分类：机器学习

2022-12-28

This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental results indicated that the proposed model is effective in terms of both naturalness and robustness of timing.

translated by 谷歌翻译

Seismic-phase detection using multiple deep learning models for global and local representations of waveforms

Tomoki Tokuda , Hiromichi Nagao

分类：机器学习

2022-11-04

The detection of earthquakes is a fundamental prerequisite for seismology and contributes to various research areas, such as forecasting earthquakes and understanding the crust/mantle structure. Recent advances in machine learning technologies have enabled the automatic detection of earthquakes from waveform data. In particular, various state-of-the-art deep-learning methods have been applied to this endeavour. In this study, we proposed and tested a novel phase detection method employing deep learning, which is based on a standard convolutional neural network in a new framework. The novelty of the proposed method is its separate explicit learning strategy for global and local representations of waveforms, which enhances its robustness and flexibility. Prior to modelling the proposed method, we identified local representations of the waveform by the multiple clustering of waveforms, in which the data points were optimally partitioned. Based on this result, we considered a global representation and two local representations of the waveform. Subsequently, different phase detection models were trained for each global and local representation. For a new waveform, the overall phase probability was evaluated as a product of the phase probabilities of each model. This additional information on local representations makes the proposed method robust to noise, which is demonstrated by its application to the test data. Furthermore, an application to seismic swarm data demonstrated the robust performance of the proposed method compared with those of other deep learning methods. Finally, in an application to low-frequency earthquakes, we demonstrated the flexibility of the proposed method, which is readily adaptable for the detection of low-frequency earthquakes by retraining only a local model.

translated by 谷歌翻译

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

Kentaro Mitsui , Tianyu Zhao , Kei Sawada , Yukiya Hono , Yoshihiko Nankaku , Keiichi Tokuda

分类：自然语言处理 | 机器学习

2022-06-24

最近的文本到语音（TTS）的质量与人类的质量相当。但是，其在口语对话中的应用尚未得到广泛研究。这项研究旨在实现与人类对话非常相似的TT。首先，我们记录并抄录实际自发对话。然后，提出的对话TTS分为两个阶段：第一阶段，各种自动编码器（VAE） - VITS或高斯混合物变化自动编码器（GMVAE） - 培训了训练，从端到端文本对语音（VIT），最近提出的端到端TTS模型。从语音中提取潜在的口语表示的样式编码器与TTS共同培训。在第二阶段，对风格预测指标进行了训练，以预测从对话历史中综合的说话风格。在推断期间，通过将样式预测器预测的语言样式表示为VAE/gmvae-vits，可以以适合对话背景的样式合成语音。主观评估结果表明，所提出的方法在对话级别的自然性方面优于原始VIT。

translated by 谷歌翻译

Compound virtual screening by learning-to-rank with gradient boosting decision tree and enrichment-based cumulative gain

Kairi Furui , Masahito Ohue

分类：计算机视觉 | 机器学习

2022-05-04

学习到级别是一种广泛用于信息检索的机器学习技术，最近已应用于基于配体的虚拟筛查问题，以加速新药开发的早期阶段。排名预测模型根据序数关系学习，使其适合从各种环境中集成测定数据。现有的化合物筛选中排名预测的研究通常使用了一种名为RankSVM的学习对方法。但是，尚未将它们与梯度提升决策树（GBDT）基于梯度的学习对级别的方法进行比较或验证，这些方法最近越来越受欢迎。此外，尽管称为归一化折扣累积增益（NDCG）的排名指标被广泛用于信息检索，但它仅确定预测是否比其他模型的预测更好。换句话说，NDCG无法识别何时预测模型比随机结果差。然而，NDCG仍用于使用学习级学习的化合物筛选的性能评估。这项研究使用了具有排名损失函数的GBDT模型，称为Lambdarank和Lambdaloss，用于基于配体的虚拟筛选。使用回归将结果与现有的RankSVM方法和GBDT模型进行比较。我们还提出了一个新的排名指标，标准化的富集折扣累积增益（NEDCG），旨在正确评估排名预测的好处。结果表明，使用GBDT和RankSVM在不同数据集上的GBDT模型优于现有的回归方法。此外，NEDCG表明，回归预测与多户多户数据集中的随机预测相当，这证明了其对更直接评估复合筛选性能的有用性。

translated by 谷歌翻译

Weakly Supervised High-Fidelity Clothing Model Generation

Ruili Feng , Cheng Ma , Chengji Shen , Xin Gao , Zhenjiang Liu , Xiaobo Li , Kairi Ou , Zhengjun Zha

分类：计算机视觉 | 人工智能

2021-12-14

在线经济学的发展引起了在产品衣服上发电模型的图像的需求，展示新衣服并促进销售。然而，昂贵的专有模型图像在这种情况下挑战现有的图像虚拟试验方法，因为大多数需要在相当多的模型图像上伴随着配对的衣服图像。在本文中，我们提出了一种廉价但可扩展的弱监管方法，称为深生成点投影（DGP）来解决此特定方案。躺在所提出的方法的核心中是模仿人类预测磨损效果的过程，这是一种基于生活经验的无人汶过高的想象，而不是从监督中学到的计算规则。在这里，使用佩带的样式甘捕获佩戴的实际经验。实验表明，将衣服和身体的粗略对准突出到样式卡空间上可以产生照片逼真的佩戴结果。实际上专有模型图像的实验证明了DGP在产生衣服模型图像时的最先进的监督方法的优越性。

translated by 谷歌翻译

Sidewalk Measurements from Satellite Images: Preliminary Findings

Maryam Hosseini , Iago B. Araujo , Hamed Yazdanpanah , Eric K. Tokuda , Fabio Miranda , Claudio T. Silva , Roberto M. Cesar Jr

分类：计算机视觉

2021-12-12

对行人基础设施，特别是人行道的大规模分析对人类以人为本的城市规划和设计至关重要。受益于通过纽约市开放数据门户提供的Procepetric特征和高分辨率OrthoImages的丰富数据集，我们培养计算机视觉模型来检测遥感图像的人行道，道路和建筑物，达到83％的Miou持有-out测试集。我们应用形状分析技术来研究提取的人行道的不同属性。更具体地，我们对人行道的宽度，角度和曲率进行了瓷砖明智的分析，除了它们对城市地区的可行性和可达性的一般影响，众所周知，在轮椅用户的移动性中具有重要作用。初步结果是有前途的，瞥见了不同城市采用的拟议方法的潜力，使研究人员和从业者可以获得更生动的行人领域的画面。

translated by 谷歌翻译

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Yukiya Hono , Kazuna Tsuboi , Kei Sawada , Kei Hashimoto , Keiichiro Oura , Yoshihiko Nankaku , Keiichi Tokuda

分类：机器学习

2020-09-17

本文提出了一种具有多粒度潜变量的分层生成模型，以综合表达语音。近年来，将细粒度的潜在变量引入了文本到语音合成中，使得韵律和讲话方式的精细控制能够进行综合演讲。然而，当通过从标准高斯先前抽样获得这些潜变量时，言语的自然度降低。为了解决这个问题，我们提出了一种用于建模细粒度潜在变量的新框架，考虑到输入文本，分层语言结构和潜在变量的时间结构的依赖性。该框架包括多粒子变形AutoEncoder，条件先前和多级自回归潜伏转换器，以获得不同的时间分辨率潜变量，并通过拍摄来对较粗级别的潜入变量进行样本考虑到输入文本。实验结果表明，在合成阶段在没有参考信号的情况下采样细粒潜变量的适当方法。我们拟议的框架还提供了整个话语中说话风格的可控性。

translated by 谷歌翻译